The Real-time Air Quality Index of different monitoring stations across India is obtained by the API provided by the Ministry of Environment & Forests, India and Central Pollution Control Board, India.
The API gives the real-time data for the hour which has the list of monitored pollutants like Carbon Monoxide(CO), Sulphur Dioxide(SO2), Nitrogen Dioxide(NO2), Particulate Matter(PM10 and PM2.5) and Ozone(O3).
The stations monitor Air pollution from different sectors, viz- Industrial Air Pollution, Residential Air Pollution, Vehicular Air Pollution, Environment, and Forest.
The source of the Data is Real time Air Quality Index from various locations.
The API gives the data for the current hour Air Quality Index of various stations. I started the collection of Data on 23/02/2020 by adding a cron-job scheduler on a linux server. The data collected is till 28th March 2020.
I created a short shell script which I could run as a cron-job on a linux machine, that will fetch the hourly data daily and append to a Master csv file which is further used for the Analysis.
Bash Script (Script also attached in the Project Folder):
%%script false
#!/bin/bash
var=($(date +%s))
#var is a temporary variable that gets the current time stamp in milliseconds to give unique temporary file for each hour.
myPATH="/home/"
#myPath is the folder for storing the temporary as well as the master csv file
echo $var
temp=$(wget -O $myPATH $var.csv "https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=YOUR_API_KEY&format=csv&offset=0&limit=10000")
#temp stores the file at myPATH/var.csv as a temporary file every hour the cron job is run
cat $myPATH $var.csv >> ${myPATH}AirQualityData.csv
#Finally, the temporary file is appended to the master CSV raw data file
API URL with the key registered on the portal:
Another approach for fetching the same data will be running the below Python script on a cron-job:
## All import statements for data pre-preocessing- collection, manipulation, transformation.
import csv
import requests
import pandas as pd
#The below code can also be used to fetch raw data from API Data source by adding the python script file for the
#below function to the cron-job scheduler.
api_url = "https://api.data.gov.in/resource/"
resource = "3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69"
#api_key is obtained after registering on the portal: https://api.data.gov.in
api_key = "YOUR_API_KEY"
csv_url = api_url+resource+"?api-key="+api_key+"&format=csv&offset=0&limit=10000"
#csv-url with appropriate link + resource id + api key and the desired format
# request_data() function will create a session with the csv_url and fetch the data from the Data source in utf-8 format
def request_data():
with requests.Session() as mySession:
download = mySession.get(csv_url)
current_data = download.content.decode('utf-8')
# Append the current session file to the existing AirQuality Data Master file each time the job is run
with open("AirQualityData_CurrentSample_Collection.csv",'a') as file_append:
file_append.write(current_data)
request_data()
# The raw data file- AirQualityData.csv which appends hourly data every day, appends all the data to the master csv.
# The raw headers that are appended every time the file is written needs to removed from the file.
def read_raw_data(raw_file):
count = 0
data_headers = ""
with open(raw_file) as fin:
clean_fin = open("AirQualityData_Formatted.csv","w", newline="")
for line in fin:
if (line[:2]=="id" and count==0):
count=1
data_headers = line
clean_fin.write(data_headers)
elif (line[:2]!="id"):
clean_fin.write(line)
clean_fin.close()
# The above function ensures that the raw data file is read and only lines with the actual data are kept along with a single
# header for the entire file
raw_file = 'Data/AirQualityData.csv'
read_raw_data(raw_file)
#The formatted data is written to a new file- AirQualityData_Formatted.csv
Reading the formatted file into a pandas dataframe
AQ_df=pd.read_csv('AirQualityData_Formatted.csv', engine="python", encoding="utf-8", error_bad_lines=False)
AQ_df.head()
AQ_df.tail()
AQ_df.describe()
As the data returned by the API is hourly and from the glimpse of the data above, its observed that the time stamp in the data is ambiguous. The last_update column has the time-stamp for the collected data, however, it does not specify if its the day time or the evening time(AM/PM)
# Sorting the dataframe by pollutant_id and station name to make the last_update column consistent
AQ_df=AQ_df.sort_values(by=["pollutant_id", "station"])
#Adding the column Timestamp which will be used in Analysis and the format/data-type of the column will be Date-Time
AQ_df["time_stamp"] = ""
myDictionary = {}
# There is a unique combination of last_update, pollutant_id, station for every hour.
# This unique combination can be used to identify the AM/PM (Morning/Evening) records.
# The data collection was started in morning time, hence it will be accurate for the first day and for subsequent data
# the date-time will be identified and the time-stamp column will be updated accordingly.
counter=0
for index, row in AQ_df.iterrows():
val = str(str(row["last_update"])+"_"+str(row["pollutant_id"])+"_"+str(row["station"]))
counter+=1
if (val in myDictionary.keys()):
# If second occurrence of the unique combination, add to the dictionary and append PM at the end
ts = str(row["last_update"])
AQ_df.at[index, "time_stamp"] = ts + (" PM")
if counter <=10:
print("Transforming date data for: ", val)
else:
# If first occurrence of the unique combination, add to the dictionary and append AM at the end
myDictionary.update({
val : val
})
ts = str(row["last_update"])
AQ_df.at[index, "time_stamp"] = ts + (" AM")
if counter<=10:
print("Transforming date data for: ", val)
elif counter==11:
print("Output limited to 10 prints transforming rows....")
AQ_df.head()
# Changing the data type of the new time-stamp column to datetime
AQ_df["time_stamp"] = pd.to_datetime(AQ_df["time_stamp"], dayfirst=True)
# Drop the columns id and, pollutant_unit as they donot have any significance
AQ_df=AQ_df.drop(['id', 'pollutant_unit'], axis=1)
AQ_df.head()
# The source column last_update can be dropped from the dataframe now:
# AQ_df=AQ_df.drop(["last_update"], axis=1)
# Resetting the index of the data:
AQ_df.reset_index(drop=True, inplace=True)
#Description of data:
AQ_df.describe()
AQ_df.head()
AQ_df.tail()
#Number of Null/NA values in the dataset
AQ_df.isnull().sum()
Handling the NA/Missing values in the data frame
The NA/missing values in the dataset are handled after carefully understanding the dataset.
The NA values are replaced by taking mean by grouping each combination to maintain maximum possible accuracy-
If the data has missing values even after the above all combinations, it could be quite possible that the station never collects that data. For example, few stations are not configured to collect the pollution level of all the seven pollutants.
# Adding date and time columns separately to the dataframe, which can be crucial in further analysis
AQ_df["date"] = AQ_df["time_stamp"].dt.date
AQ_df["time"] = AQ_df["time_stamp"].dt.time
print("Number of missing values in the data set before processing: ")
print("***************************************************************** ")
print(AQ_df.isnull().sum())
print("*****************************************************************")
# Aggregates and fills null values considering the same time, station and pollutant type
AQ_df["pollutant_min"] = AQ_df.groupby(["station", "pollutant_id", "time"])["pollutant_min"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_max"] = AQ_df.groupby(["station", "pollutant_id", "time"])["pollutant_max"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_avg"] = AQ_df.groupby(["station", "pollutant_id", "time"])["pollutant_avg"].transform(lambda x: x.fillna(x.mean()))
# Aggregates and fills null values considering the same day, station and pollutant type
AQ_df["pollutant_min"] = AQ_df.groupby(["station", "pollutant_id", "date"])["pollutant_min"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_max"] = AQ_df.groupby(["station", "pollutant_id", "date"])["pollutant_max"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_avg"] = AQ_df.groupby(["station", "pollutant_id", "date"])["pollutant_avg"].transform(lambda x: x.fillna(x.mean()))
# Aggregates and fills null values considering the same station and pollutant type
AQ_df["pollutant_min"] = AQ_df.groupby(["station", "pollutant_id"])["pollutant_min"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_max"] = AQ_df.groupby(["station", "pollutant_id"])["pollutant_max"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_avg"] = AQ_df.groupby(["station", "pollutant_id"])["pollutant_avg"].transform(lambda x: x.fillna(x.mean()))
# Aggregates and fills null values considering the same time-stamp(day and time), city and pollutant type
AQ_df["pollutant_min"] = AQ_df.groupby(["city", "pollutant_id", "time_stamp"])["pollutant_min"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_max"] = AQ_df.groupby(["city", "pollutant_id", "time_stamp"])["pollutant_max"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_avg"] = AQ_df.groupby(["city", "pollutant_id", "time_stamp"])["pollutant_avg"].transform(lambda x: x.fillna(x.mean()))
# Aggregates and fills null values considering the same time, city and pollutant type
AQ_df["pollutant_min"] = AQ_df.groupby(["city", "pollutant_id", "time"])["pollutant_min"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_max"] = AQ_df.groupby(["city", "pollutant_id", "time"])["pollutant_max"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_avg"] = AQ_df.groupby(["city", "pollutant_id", "time"])["pollutant_avg"].transform(lambda x: x.fillna(x.mean()))
# Aggregates and fills null values considering the same day, city and pollutant type
AQ_df["pollutant_min"] = AQ_df.groupby(["city", "pollutant_id", "date"])["pollutant_min"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_max"] = AQ_df.groupby(["city", "pollutant_id", "date"])["pollutant_max"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_avg"] = AQ_df.groupby(["city", "pollutant_id", "date"])["pollutant_avg"].transform(lambda x: x.fillna(x.mean()))
# Aggregates and fills null values considering the same city and pollutant type
AQ_df["pollutant_min"] = AQ_df.groupby(["city", "pollutant_id"])["pollutant_min"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_max"] = AQ_df.groupby(["city", "pollutant_id"])["pollutant_max"].transform(lambda x: x.fillna(x.mean()))
AQ_df["pollutant_avg"] = AQ_df.groupby(["city", "pollutant_id"])["pollutant_avg"].transform(lambda x: x.fillna(x.mean()))
print("Number of missing values in the data set after processing: ")
print("*****************************************************************")
print(AQ_df.isnull().sum())
print("*****************************************************************")
# Drops the null value from the data frame as there is no data collected and no further use in the analysis.
AQ_df=AQ_df.dropna()
print("Final number of missing values in the data set after processing and removing records with missing data: ")
print("*****************************************************************")
print(AQ_df.isnull().sum())
print("*****************************************************************")
As the data is now cleaned, we can further analyse to gain some insights on the Air Quality Index of India
Before we proceed with the analysis of the Air Quality Index, the below image gives overview of parameters for evaluating the Air Quality Index depending on the Pollutant Levels:
Image Source: https://cpcb.nic.in/
byState=AQ_df.groupby(["state", "date"], as_index=False)["pollutant_avg"].mean()
avgbyState=byState.groupby(["state"])["pollutant_avg"].mean()
avgbyState=avgbyState.to_frame()
print("-----------------------------------------------------------------")
print("\033[0m"+"State with maximum average pollution daily (considering all Pollutants): ")
print("\033[1m"+"\033[91m"+avgbyState["pollutant_avg"].idxmax()," : ", avgbyState["pollutant_avg"].max())
print("\033[0m"+"-----------------------------------------------------------------")
print("State with minimum average pollution daily (considering all Pollutants): ")
print("\033[1m"+"\033[92m"+avgbyState["pollutant_avg"].idxmin()," : ", avgbyState["pollutant_avg"].min())
byCity=AQ_df.groupby(["city","date"], as_index=False)["pollutant_avg"].mean()
avgbyCity=byCity.groupby(["city"])["pollutant_avg"].mean()
avgbyCity=avgbyCity.to_frame()
print("\033[0m"+"-----------------------------------------------------------------")
print("City with maximum average pollution daily (considering all Pollutants): ")
print("\033[1m"+"\033[91m"+avgbyCity["pollutant_avg"].idxmax()," : ", avgbyCity["pollutant_avg"].max())
print("\033[0m"+"-----------------------------------------------------------------")
print("City with minimum average pollution daily (considering all Pollutants): ")
print("\033[1m"+"\033[92m"+avgbyCity["pollutant_avg"].idxmin(), " : ", avgbyCity["pollutant_avg"].min())
print("\033[0m"+"-----------------------------------------------------------------")
Insights:
The above cell gives a general idea about States and cities with highest and lowest average pollution considering all the pollutants. We cannot compare it on a certain scale or unit of pollutant as it considers all the pollutants.
It just gives a summary of cities & states with highest & lowest average pollution in consideration with other cities & states in India.
# All import statements used in creating visualisations
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from scipy import stats
import matplotlib
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
avgbyState=avgbyState.sort_values('pollutant_avg', ascending=True)
plt.figure(figsize=(12,8))
plt.title("Average Daily Pollution by State", fontsize=20, fontweight="bold")
plt.barh(avgbyState.index, avgbyState["pollutant_avg"], align="center", color="indianred")
plt.ylabel("State", fontsize=14)
plt.xlabel("Average Pollution Level", fontsize=14)
Insights:
The above bar chart compares state-wise pollution level in each state in India that collects the Air Quality Index at various stations.
The states with highest pollution level are clearly those with the highest population as well as with large number of industries. Whereas the states with lowest pollution level and good air quality index are those with a large number of vegetation and forests and sparesly populated regions in India
AQ_df["dayofweek"] = AQ_df["time_stamp"].dt.dayofweek
avgbyDay=AQ_df.groupby(["dayofweek", "pollutant_id"], as_index=False)["pollutant_avg"].mean()
avgbyDay=avgbyDay.pivot(index="dayofweek", columns="pollutant_id", values="pollutant_avg")
plt.figure(figsize=(15,8))
plt.plot(avgbyDay.index, avgbyDay["CO"], label="CO")
plt.plot(avgbyDay.index, avgbyDay["NH3"], label="NH3")
plt.plot(avgbyDay.index, avgbyDay["SO2"], label="SO2")
plt.plot(avgbyDay.index, avgbyDay["OZONE"], label="OZONE")
plt.plot(avgbyDay.index, avgbyDay["PM2.5"], label="PM2.5")
plt.plot(avgbyDay.index, avgbyDay["PM10"], label="PM10")
plt.plot(avgbyDay.index, avgbyDay["NO2"], label="NO2")
plt.xlabel('Day of Week', fontsize=14)
plt.ylabel('Average Pollution', fontsize=14)
plt.title('Average Pollution Daily', fontsize=20, fontweight="bold")
plt.legend()
x_labels= ["Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday","Sunday"]
plt.xticks(np.arange(0,7),labels=x_labels)
plt.show()
Insights:
The above line chart shows a trend on how the average Air Quality Index of all the cities various days in week, each line denotes a different pollutant. In general it can be observed that the pollution level on Weekdays is high as compared to weekends.
From this Line graph, we can identify that NH3 is the pollutant, which contributes least to the over all pollution level Index, where as PM2.5 and PM10 are the major contributors in pollution level.
avgbyDay_hm=pd.DataFrame(columns=avgbyDay.columns, index=avgbyDay.index)
for col in avgbyDay_hm.columns:
test=stats.zscore(avgbyDay[[col]])
avgbyDay_hm[col] = pd.DataFrame(test)
# z-score normalisation is being used to normalise the avg_pollutant column
sns.set()
f, ax = plt.subplots(figsize=(15, 12))
daysofWeek= ["Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday","Sunday"]
sns.heatmap(avgbyDay_hm, annot=False, linewidths=0, ax=ax, cmap="YlOrRd",yticklabels=daysofWeek)
plt.yticks(rotation=0)
plt.xlabel('Pollutant', fontsize=14)
plt.ylabel('Day of Week', fontsize=14)
plt.title('Average Pollution Levels by Day', fontsize=20, fontweight="bold")
Insights:
The previous Line chart compared each and every pollutants national average for every day of the week.
This Heatmap has the data which is normalized using zscore, so as to compare all of the pollutants on a similar scale in comparison to their levels of all other days.
The heatmap clearly indicates that the pollution levels are higher on Weekdays and less on weekends for all pollutants.
avgbyTime=AQ_df.groupby(["time", "pollutant_id"], as_index=False)["pollutant_avg"].mean()
avgbyTime=avgbyTime.pivot(index="time", columns="pollutant_id", values="pollutant_avg")
#Evaluating the pollution level by hourly basis with respect to all pollutants.
avgbyTime_hm=pd.DataFrame(stats.zscore(avgbyTime), columns=avgbyDay.columns)
# Getting z-score normalised results of the avg pollution level for each hour.
sns.set()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(avgbyTime_hm, annot=False, linewidths=0, ax=ax, cmap="bwr")
plt.yticks(rotation=0)
plt.ylabel('Time', fontsize=14, fontweight="bold")
plt.xlabel('Type of Pollutant', fontsize=14, fontweight="bold")
plt.title('Average Hourly Pollution by Pollutant', fontsize=20, fontweight="bold")
Insights:
The above heatmap gives general idea on how the pollution level various across the day in 24 hours for each of the pollutant.
It can be observed that most of the pollution level is high during the evening and moderate during the morning times.
The color denotes the level of the pollution, Dark blue denotes lowest and red the highest
fig = go.Figure(data=go.Scatter(x=avgbyTime.index,
y=avgbyTime["SO2"],
mode='markers'))
fig.update_layout(title='<b>Pollution due to Sulphur Dioxide over the time</b>', yaxis_zeroline=False, xaxis_zeroline=False,
xaxis_title="Time of the Day", yaxis_title="Level of Sulphur Dioxide")
fig.update_traces(mode='markers', marker_line_width=2, marker_size=10)
fig.show()
Insights:
From the above scatter plot we can find a general trend that the level of SO2 pollutant starts increasing around 8 AM in the morning and continues to increase until 9-10PM in the evening.
SO2 is one of the major constituent in the vehicular pollution, hence it is obvious to find such a trend with the SO2 pollutant during those hours in the day.
avgbyCity=AQ_df.groupby(["city","pollutant_id"], as_index=False)["pollutant_avg"].mean()
avgbyCity=avgbyCity.pivot(index="city", columns="pollutant_id", values="pollutant_avg")
From few of the below Summary tables we can identify the top 5 cities with highest average pollution in each type of pollutant.
top5_CO=pd.DataFrame(avgbyCity.sort_values("CO", ascending=False)["CO"].head(5))
top5_CO=top5_CO.round(2)
fig = go.Figure(data=[go.Table(header=dict(values=['<b>City</b>', '<b>Average Pollution</b>']),
cells=dict(values=[top5_CO.index,top5_CO.values]))
])
fig.update_layout(
title={
'text': "<b>Top 5 Cities with Highest Average CO Pollutant</b>"}, width=500, height=500,)
fig.show()
top5_NO2=pd.DataFrame(avgbyCity.sort_values("NO2", ascending=False)["NO2"].head(5))
top5_NO2=top5_NO2.round(2)
top5_NO2
fig = go.Figure(data=[go.Table(header=dict(values=['<b>City</b>', '<b>Average Pollution</b>']),
cells=dict(values=[top5_NO2.index,top5_NO2.values]))
])
fig.update_layout(
title={
'text': "<b>Top 5 Cities with Highest Average NO2 Pollutant</b>"}, width=500, height=500,)
fig.show()
top5_SO2=pd.DataFrame(avgbyCity.sort_values("SO2", ascending=False)["SO2"].head(5))
top5_SO2=top5_SO2.round(2)
fig = go.Figure(data=[go.Table(header=dict(values=['<b>City</b>', '<b>Average Pollution</b>']),
cells=dict(values=[top5_SO2.index,top5_SO2.values]))
])
fig.update_layout(
title={
'text': "<b>Top 5 Cities with Highest Average SO2 Pollutant</b>"}, width=500, height=500,)
fig.show()
top5_PM2_5=pd.DataFrame(avgbyCity.sort_values("PM2.5", ascending=False)["PM2.5"].head(5))
top5_PM2_5=top5_PM2_5.round(2)
fig = go.Figure(data=[go.Table(header=dict(values=['<b>City</b>', '<b>Average Pollution</b>']),
cells=dict(values=[top5_PM2_5.index,top5_PM2_5.values]))
])
fig.update_layout(
title={
'text': "<b>Top 5 Cities with Highest Average PM2.5 Pollutant</b>"}, width=500, height=500,)
fig.show()
top5_PM10=pd.DataFrame(avgbyCity.sort_values("PM10", ascending=False)["PM10"].head(5))
top5_PM10=top5_PM10.round(2)
fig = go.Figure(data=[go.Table(header=dict(values=['<b>City</b>', '<b>Average Pollution</b>']),
cells=dict(values=[top5_PM10.index,top5_PM10.values]))
])
fig.update_layout(
title={
'text': "<b>Top 5 Cities with Highest Average PM10 Pollutant</b>"}, width=500, height=500,)
fig.show()
top5_OZONE=pd.DataFrame(avgbyCity.sort_values("OZONE", ascending=False)["NO2"].head(5))
top5_OZONE=top5_OZONE.round(2)
fig = go.Figure(data=[go.Table(header=dict(values=['<b>City</b>', '<b>Average Pollution</b>']),
cells=dict(values=[top5_OZONE.index,top5_OZONE.values]))
])
fig.update_layout(
title={
'text': "<b>Top 5 Cities with Highest Average OZONE Pollutant</b>"}, width=600, height=500,)
fig.show()
Particulate Matter (PM) are the major constituent of the Air Pollution in the Air Quality Index in India, how exactly does it vary over different days in the last month.
avgbyDay=AQ_df.groupby(["date", "pollutant_id"], as_index=False)["pollutant_avg"].mean()
#avgbyDay gives average daily pollution of each day for every pollutant.
#Pivoting the dataset to plot a timeseries line chart.
avgbyDay=avgbyDay.pivot(index="date", columns="pollutant_id", values="pollutant_avg")
plt.figure(figsize=(15,8))
fig=plt.plot(avgbyDay.index, avgbyDay["PM2.5"], label="PM2.5", marker="o")
fig=plt.plot(avgbyDay.index, avgbyDay["PM10"], label="PM10", marker="o")
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Pollution', fontsize=14)
plt.title('Average Daily Particulate Matter(PM) Pollution', fontsize=20, fontweight="bold")
plt.xticks(rotation=30)
plt.legend()
plt.show()
Insights:
From the above chart it is evident that the pollution level due to PM2.5 and PM10 almost goes hand in hand. We see a general trend of less pollution due to these Particulate Matters is reduced since last couple of weeks or in general since start of March.
The trend could be due to a rise in cases in the COVID-19 pandemic and the government taking strict measures to contain the spread.
avgbyDay_min=AQ_df.groupby(["date", "pollutant_id"], as_index=False)["pollutant_min"].mean()
avgbyDay_min=avgbyDay_min.pivot(index="date", columns="pollutant_id", values="pollutant_min")
# Saving the minimum and maxiumum values of each pollutant for each- minimum and maximum pollutant levels.
avgbyDay_max=AQ_df.groupby(["date", "pollutant_id"], as_index=False)["pollutant_max"].mean()
avgbyDay_max=avgbyDay_max.pivot(index="date", columns="pollutant_id", values="pollutant_max")
fig = go.Figure()
fig.add_trace(go.Scatter(x=avgbyDay_min.index, y=avgbyDay_min["PM2.5"], fill=None, name="Min PM2.5")) # fill down to minimum level
fig.add_trace(go.Scatter(x=avgbyDay_min.index, y=avgbyDay_max["PM2.5"], fill='tonexty', name="Max PM2.5")) # fill to maximum level
fig.update_layout(showlegend=True,
title={"text":"<b>Daily Low and High Particulate Matter 2.5(PM2.5) levels in Air</b>"},
xaxis_title="Date", yaxis_title="Level of Particulate Matter Pollution",
xaxis_tickformat = '%d %B<br>%Y')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=avgbyDay_max.index, y=avgbyDay_min["PM10"], fill=None, name="Min PM10")) # fill down to xaxis
fig.add_trace(go.Scatter(x=avgbyDay_max.index, y=avgbyDay_max["PM10"], fill='tonexty', name="Max PM10")) # fill to trace0 y
fig.update_layout(showlegend=True,
title={"text":"<b>Daily Low and High Particulate Matter 10(PM10) levels in Air</b>"},
xaxis_title="Date", yaxis_title="Level of Particulate Matter Pollution",
xaxis_tickformat = '%d %B<br>%Y')
fig.show()
Insights:
From the above area charts we can get a general idea minimum and maximum levels of PM2.5 and PM10 throughout the day.
The lower blue line specifies the minimum average level of PM2.5 and PM10 in the chart, where as the Orange line specifies the highest level.
The area between those lines gives an estimate of average Particulate matter present in the air present on a specific day.
x=avgbyDay_min.index
fig = go.Figure()
fig.add_trace(go.Scatter(
x=x, y=avgbyDay_min["CO"],
hoverinfo='x+y',
mode='lines+markers',
line=dict(width=0.8, color='rgb(131, 90, 241)'),
stackgroup='one', # define stack group,
name="CO"
))
fig.add_trace(go.Scatter(
x=x, y=avgbyDay_min["NO2"],
hoverinfo='x+y',
mode='lines+markers',
line=dict(width=0.8, color='rgb(111, 231, 219)'),
stackgroup='one',
name="NO2"
))
fig.add_trace(go.Scatter(
x=x, y=avgbyDay_min["SO2"],
hoverinfo='x+y',
mode='lines+markers',
line=dict(width=0.8, color='rgb(184, 247, 212)'),
stackgroup='one',
name="SO2"
))
fig.add_trace(go.Scatter(
x=x, y=avgbyDay_min["NH3"],
hoverinfo='x+y',
mode='lines+markers',
line=dict(width=0.8, color='rgb(381, 121, 312)'),
stackgroup='one',
name="NH3"
))
fig.update_layout(title={"text":"<b>Percentage of Pollution constituents other than Particulate Matters</b>"},showlegend=True,
yaxis=dict(
type='linear',
range=[1, 100],
ticksuffix='%'),xaxis_tickformat = '%d %B<br>%Y')
fig.show()
Insights:
From the above area chart it is evident that the major constituent of air pollution after particulate matters is CO and SO2. Whereas NH3 is the least contributor to the overall average pollution
top5_CO=pd.DataFrame(avgbyCity.sort_values("CO", ascending=True)["CO"].head(5))
top5_CO=top5_CO.round(2)
top5_NO2=pd.DataFrame(avgbyCity.sort_values("NO2", ascending=True)["NO2"].head(5))
top5_NO2=top5_NO2.round(2)
top5_SO2=pd.DataFrame(avgbyCity.sort_values("SO2", ascending=True)["SO2"].head(5))
top5_SO2=top5_SO2.round(2)
top5_PM2_5=pd.DataFrame(avgbyCity.sort_values("PM2.5", ascending=True)["PM2.5"].head(5))
top5_PM2_5=top5_PM2_5.round(2)
top5_PM10=pd.DataFrame(avgbyCity.sort_values("PM10", ascending=True)["PM10"].head(5))
top5_PM10=top5_PM10.round(2)
top5_OZONE=pd.DataFrame(avgbyCity.sort_values("OZONE", ascending=True)["NO2"].head(5))
top5_OZONE=top5_OZONE.round(2)
On the below gauge indicators, the green range specifies best air quality index, yellow denotes moderately polluted, red denotes poor, where as grey denotes severe health issues can occur due to that levels.
The blue stream displays the average level of pollution in that city. Also, to get the exact number, the number in large font denotes the exact levels.
The black tick in the gauge denotes the national average for the particular pollutant.
The number below the cities pollution level with a delta reference denotes, how less or more it is than the national average for that pollutant.
fig1 = go.Figure(go.Indicator(
mode = "delta+number+gauge",
value = top5_CO.values.min(),
title = {'text': str("CO Pollution "+top5_CO["CO"].idxmin()+" city")},
delta = {'reference': avgbyCity["CO"].mean(), 'increasing': {'color': "RebeccaPurple"}},
domain = {'x': [0, 0.25], 'y': [1, 1], 'row':1},
align= "left",
gauge = {'axis': {'range': [None, 50], 'tickwidth': 1, 'tickcolor': "darkblue"},
'bar': {'color': "blue"},
'steps' : [
{'range': [0, 2], 'color': "green"},
{'range': [2, 17], 'color': "yellow"},
{'range': [17, 40], 'color': "red"},
{'range': [40, 50], 'color': "gray"}],
'threshold' : {'line': {'color': "black", 'width': 4}, 'thickness': 0.75, 'value': avgbyCity["CO"].mean()}}
))
fig1.update_layout(autosize=True)
fig1.show()
Insights:
The above gauge indicator indicates the best city with lowest pollution with respect to CO levels.
fig1 = go.Figure(go.Indicator(
mode = "delta+number+gauge",
value = top5_NO2.values.min(),
title = {'text': str("NO2 Pollution "+top5_NO2["NO2"].idxmin()+" city")},
delta = {'reference': avgbyCity["NO2"].mean(), 'increasing': {'color': "RebeccaPurple"}},
domain = {'x': [0, 0.25], 'y': [1, 1], 'row':1},
align= "left",
gauge = {'axis': {'range': [None, 500], 'tickwidth': 1, 'tickcolor': "darkblue"},
'bar': {'color': "blue"},
'steps' : [
{'range': [0, 80], 'color': "green"},
{'range': [80, 280], 'color': "yellow"},
{'range': [280, 400], 'color': "red"},
{'range': [400, 500], 'color': "gray"}],
'threshold' : {'line': {'color': "black", 'width': 4}, 'thickness': 0.75, 'value': avgbyCity["NO2"].mean()}}
))
fig1.update_layout(
autosize=True)
fig1.show()
Insights:
The above gauge indicator indicates the best city with lowest pollution with respect to NO2 levels.
fig1 = go.Figure(go.Indicator(
mode = "delta+number+gauge",
value = top5_SO2.values.min(),
title = {'text': str("SO2 Pollution "+top5_SO2["SO2"].idxmin()+" city")},
delta = {'reference': avgbyCity["SO2"].mean(), 'increasing': {'color': "RebeccaPurple"}},
domain = {'x': [0, 0.25], 'y': [1, 1], 'row':1},
align= "left",
gauge = {'axis': {'range': [None, 2000], 'tickwidth': 1, 'tickcolor': "darkblue"},
'bar': {'color': "blue"},
'steps' : [
{'range': [0, 80], 'color': "green"},
{'range': [80, 800], 'color': "yellow"},
{'range': [800, 1600], 'color': "red"},
{'range': [1600, 2000], 'color': "gray"}],
'threshold' : {'line': {'color': "black", 'width': 4}, 'thickness': 0.75, 'value': avgbyCity["SO2"].mean()}}
))
fig1.update_layout(
autosize=True)
fig1.show()
Insights:
The above gauge indicator indicates the best city with lowest pollution with respect to SO2 levels.
fig1 = go.Figure(go.Indicator(
mode = "delta+number+gauge",
value = top5_PM2_5.values.min(),
title = {'text': str("PM2.5 Pollution "+top5_PM2_5["PM2.5"].idxmin()+" city")},
delta = {'reference': avgbyCity["PM2.5"].mean(), 'increasing': {'color': "RebeccaPurple"}},
domain = {'x': [0, 0.25], 'y': [1, 1], 'row':1},
align= "left",
gauge = {'axis': {'range': [None, 300], 'tickwidth': 1, 'tickcolor': "darkblue"},
'bar': {'color': "blue"},
'steps' : [
{'range': [0, 60], 'color': "green"},
{'range': [61, 120], 'color': "yellow"},
{'range': [120, 250], 'color': "red"},
{'range': [250, 300], 'color': "gray"}],
'threshold' : {'line': {'color': "black", 'width': 4}, 'thickness': 0.75, 'value': avgbyCity["PM2.5"].mean()}}
))
fig1.update_layout(
autosize=True)
fig1.show()
Insights:
The above gauge indicator indicates the best city with lowest pollution with respect to PM2.5 levels.
fig1 = go.Figure(go.Indicator(
mode = "delta+number+gauge",
value = top5_PM10.values.min(),
title = {'text': str("PM10 Pollution "+top5_PM10["PM10"].idxmin()+" city")},
delta = {'reference': avgbyCity["PM10"].mean(), 'increasing': {'color': "RebeccaPurple"}},
domain = {'x': [0, 0.25], 'y': [1, 1], 'row':1},
align= "left",
gauge = {'axis': {'range': [None, 500], 'tickwidth': 1, 'tickcolor': "darkblue"},
'bar': {'color': "blue"},
'steps' : [
{'range': [0, 100], 'color': "green"},
{'range': [100, 350], 'color': "yellow"},
{'range': [350, 430], 'color': "red"},
{'range': [430, 500], 'color': "gray"}],
'threshold' : {'line': {'color': "black", 'width': 4}, 'thickness': 0.75, 'value': avgbyCity["PM10"].mean()}}
))
fig1.update_layout(
autosize=True)
fig1.show()
Insights:
The above gauge indicator indicates the best city with lowest pollution with respect to PM10 levels.
From the above exploratory data analysis on the Air Quality Index in India of various cities, we are able to get a general overview on how the pollution level or the Air Quality Index varies across various days in a week. Additionally, how it changes during different times of days for various pollutants.
In addition to the above, we can identify and rank most polluted cities, states. Get details on cities with lowest pollution levels.
We were also able to identify the major constituents of Air Pollution that deteroriate the Air Quality Index.
We also saw a trend in decreasing pollution levels recently due to a lockdown in India to prevent community spread of COVID-19 as compared to previous weeks.